pandas练习-多层索引的创建和各种操作(multiindex)第一部分

2019年01月22日

文章目录

1. 创建MultiIndex
2. MultiIndex.names
3. MultiIndex可以作为列名称
4. 获取各水平的值
5. 选择数据
6. 选择行
7. 数据对齐

分层/多级索引非常令人兴奋，因为它为一些非常复杂的数据分析和操作提供了可能性，特别是对于处理更高维度的数据。从本质上讲，它使你能在较低维度的数据结构(如Series（1d）和DataFrame（2d）)中存储和操作具有任意数量维度的数据。

在这篇文章中, 你将会学到什么是”MultiIndex”, 以及如何创建和操作MultiIndex。

创建MultiIndex

MultiIndex对象是标准Index对象的扩展, 你可以将MultiIndex视为元组构成的列表，其中每个元组都是唯一的, 它与Index的区别是, Index可以视为数字或者字符串构成的列表。可以从数组列表（使用MultiIndex.from_arrays），元组列表（使用MultiIndex.from_tuples）或交叉迭代集（使用MultiIndex.from_product）创建MultiIndex。当构造函数传递元组列表时，它将尝试返回MultiIndex。以下示例演示了创建MultiIndexes的不同方法。

from_tuples

下面先创建一个元祖构成的列表:

import pandas as pd

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
            ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))
tuples

输出(plain):
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]

使用from_tuples来创建MultiIndex:

1 2	index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) index

输出(plain):
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])

创建一个series, 并设置它的index:

1
2
3

import numpy as np
s = pd.Series(np.random.randn(8), index=index)
s

输出(plain):
first second
bar one 1.457857
two 0.999506
baz one -1.556818
two 1.716127
foo one -1.562564
two 0.313624
qux one 0.537644
two 1.178401
dtype: float64

from_arrays

如果说from_tuples接受的参数是”行”的列表, 那么 from_arrays接受的参数是就是”列”的列表:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
            ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

index = pd.MultiIndex.from_arrays(arrays)
s = pd.Series(np.random.randn(8), index=index)
s

输出(plain):
bar one -1.754944
two 1.111560
baz one -1.291416
two 1.556595
foo one 0.147699
two 1.379124
qux one -0.981192
two 0.292709
dtype: float64

不过为了简便, 我们通常可以直接在Series的构造函数中使用:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
            ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

s = pd.Series(np.random.randn(8), index=arrays)
s

输出(plain):
bar one 0.507618
two -2.190117
baz one -0.138124
two -2.175832
foo one -0.570554
two -0.851560
qux one -0.784552
two 0.003748
dtype: float64

from_product

假如我们有两个list, 这两个list内的元素相互交叉, 两两搭配, 这就是两个list的product:

lists = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]

index = pd.MultiIndex.from_product(lists, names=['first', 'second'])
s = pd.Series(np.random.randn(len(index)), index=index)
s

输出(plain):
first second
bar one -1.209379
two 0.497207
baz one 0.592290
two -0.769594
foo one -0.935071
two 0.201014
qux one -0.176715
two -0.183346
dtype: float64

MultiIndex.names

你可以为MultiIndex的各个层起名字, 这就是names属性:

1 2	# 我们还没有设置名称 s.index.names

输出(plain):
FrozenList([None, None])

1
2
3

s.index.names = ['FirstLevel', 'SecondLevel']

s.index

输出(plain):
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['FirstLevel', 'SecondLevel'])

MultiIndex可以作为列名称

Series和DataFrame的列名称属性就是columns, 他也可以是一个MultiIndex对象:

1 2	df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index) df

输出(html):

FirstLevel	bar		baz		foo		qux
SecondLevel	one	two	one	two	one	two	one	two
A	1.014034	1.464162	-0.753476	-0.163394	-0.198990	0.116046	-0.555008	-0.965797
B	-1.314244	-1.263142	1.523974	0.541391	-0.217874	0.019695	1.188791	-1.003912
C	-1.175155	-0.587370	0.352587	-0.047469	-0.896502	0.280792	0.806554	0.132662

获取各水平的值

方法get_level_values将返回特定级别的每个位置的标签向量：

1	index.get_level_values(0)

输出(plain):
Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='FirstLevel')

如果你给index设置了名称, 那么你可以直接使用名称来获取水平值:

1	index.get_level_values('FirstLevel')

输出(plain):
Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='FirstLevel')

选择数据

这可能是MultiIndex最重要的功能之一。

先看下我们的df的结构:

df

输出(html):

FirstLevel	bar		baz		foo		qux
SecondLevel	one	two	one	two	one	two	one	two
A	1.014034	1.464162	-0.753476	-0.163394	-0.198990	0.116046	-0.555008	-0.965797
B	-1.314244	-1.263142	1.523974	0.541391	-0.217874	0.019695	1.188791	-1.003912
C	-1.175155	-0.587370	0.352587	-0.047469	-0.896502	0.280792	0.806554	0.132662

获取FirstLevel是bar的所有数据:

df['bar']

输出(html):

SecondLevel	one	two
A	1.014034	1.464162
B	-1.314244	-1.263142
C	-1.175155	-0.587370

获取FirstLevel是bar, SecondLevel是one的所有数据:

1	df['bar', 'one']

输出(plain):
A 1.014034
B -1.314244
C -1.175155
Name: (bar, one), dtype: float64

我更喜欢这样来用, 意义更明确:

1	df['bar']['one']

输出(plain):
A 1.014034
B -1.314244
C -1.175155
Name: one, dtype: float64

需要注意的是, 结果选择输出的结果的columns已经改变:

1	df['bar'].columns

输出(plain):
Index(['one', 'two'], dtype='object', name='SecondLevel')

如果你要选择第二层的列名为one的所有数据, 你需要借助xs方法:

1	df.xs('one', level=1, axis=1)

输出(html):

FirstLevel	bar	baz	foo	qux
A	1.014034	-0.753476	-0.198990	-0.555008
B	-1.314244	1.523974	-0.217874	1.188791
C	-1.175155	0.352587	-0.896502	0.806554

或者使用名称代替数字:

1	df.xs('one', level='SecondLevel', axis='columns')

输出(html):

FirstLevel	bar	baz	foo	qux
A	1.014034	-0.753476	-0.198990	-0.555008
B	-1.314244	1.523974	-0.217874	1.188791
C	-1.175155	0.352587	-0.896502	0.806554

我喜欢xs的原因是, 它不仅可以用来选择列, 也可以用来选择行:

输出(plain):
FirstLevel SecondLevel
bar one -1.754944
two 1.111560
baz one -1.291416
two 1.556595
foo one 0.147699
two 1.379124
qux one -0.981192
two 0.292709
dtype: float64

1	s.xs('one', level='SecondLevel', axis='index')

输出(plain):
FirstLevel
bar -1.754944
baz -1.291416
foo 0.147699
qux -0.981192
dtype: float64

选择行

下面我们把df进行转置, 然后看看一些选择行的操作:

1
2
3

df = df.T

df

输出(html):

		A	B	C
FirstLevel	SecondLevel
bar	one	1.014034	-1.314244	-1.175155
bar	two	1.464162	-1.263142	-0.587370
baz	one	-0.753476	1.523974	0.352587
baz	two	-0.163394	0.541391	-0.047469
foo	one	-0.198990	-0.217874	-0.896502
foo	two	0.116046	0.019695	0.280792
qux	one	-0.555008	1.188791	0.806554
qux	two	-0.965797	-1.003912	0.132662

选择FirstLevel是bar, SecondLevel是two的数据:

1	df.loc[('bar', 'two')]

输出(plain):
A 1.464162
B -1.263142
C -0.587370
Name: (bar, two), dtype: float64

下面的用法是等效的:

1	df.loc['bar'].loc['two']

输出(plain):
A 1.464162
B -1.263142
C -0.587370
Name: two, dtype: float64

选择行的同时也能选择列:

1	df.loc[('bar', 'two'), 'A']

输出(plain):
1.4641615518687836

我们还能使用切片操作:

1	df.loc['baz': 'foo']

输出(html):

		A	B	C
FirstLevel	SecondLevel
baz	one	-0.753476	1.523974	0.352587
baz	two	-0.163394	0.541391	-0.047469
foo	one	-0.198990	-0.217874	-0.896502
foo	two	0.116046	0.019695	0.280792

或许, 使用更多的是这样:

1	df.loc[('bar', 'two'): ('baz', 'two')]

输出(html):

		A	B	C
FirstLevel	SecondLevel
bar	two	1.464162	-1.263142	-0.587370
baz	one	-0.753476	1.523974	0.352587
baz	two	-0.163394	0.541391	-0.047469

当然, 我还是推荐大家使用xs, 它可以使你的代码更容易被别人理解, 而且选择行和列都用统一的方式:

1	df.xs('two', level='SecondLevel', axis='index')

输出(html):

	A	B	C
FirstLevel
bar	1.464162	-1.263142	-0.587370
baz	-0.163394	0.541391	-0.047469
foo	0.116046	0.019695	0.280792
qux	-0.965797	-1.003912	0.132662

数据对齐

如果你需要对数据进行运算, 那么设置好了index可以给你带来很多便利:

输出(plain):
FirstLevel SecondLevel
bar one -1.754944
two 1.111560
baz one -1.291416
two 1.556595
foo one 0.147699
two 1.379124
qux one -0.981192
two 0.292709
dtype: float64

假设我们只需要对第二个元素之后的数据进行运算, 我们的pandas为我们做了按照index的自动数据对齐:

s + s[2:]

输出(plain):
FirstLevel SecondLevel
bar one NaN
two NaN
baz one -2.582832
two 3.113191
foo one 0.295398
two 2.758247
qux one -1.962383
two 0.585417
dtype: float64

或许下面这个看起来更有用:

1	s + s[::2]

输出(plain):
FirstLevel SecondLevel
bar one -3.509889
two NaN
baz one -2.582832
two NaN
foo one 0.295398
two NaN
qux one -1.962383
two NaN
dtype: float64

注意
本文由jupyter notebook转换而来, 您可以在这里下载notebook
统计咨询请加QQ 2726725926, 微信 mllncn, SPSS统计咨询是收费的
微博上@mlln-cn可以向我免费题问
请记住我的网址: mlln.cn 或者 jupyter.cn

#pandas

pandas练习-多层索引的创建和各种操作(multiindex)第一部分

创建MultiIndex

from_tuples

from_arrays

from_product

MultiIndex.names

MultiIndex可以作为列名称

获取各水平的值

选择数据

选择行

数据对齐

统计咨询

赞助

赞助推荐

常用工具

pandas

友商赞助